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Abstract. Arc-annotated sequences are useful in representing the struc- 
tural information of RNA and protein sequences. The longest arc-pre- 
serving common subsequence problem has been introduced as a frame- 
work for studying the similarity of arc-annotated sequences. In this pa- 
per, we consider arc-annotated sequences with various arc structures. 
We consider the longest arc preserving common subsequence problem. 
In particular, we show that the decision version of the 1 -fragment 
LAPCS (crossing, chain) and the decision version of the 0-diagonal 
LAPCS (crossing, chain) are NP-complete for some fixed alphabet I 
such that |r| = 2. Also we show that if = 1 , then the decision version 
of the 1-fragment LAPCS(unlimited, plain) and the decision ver- 
sion of the 0-diagonal LAPCS (unlimited, plain) are NP-complete. 

1 Introduction 



Algorithms on sequences of symbols have been studied for a long time and 
now form a fundamental part of computer science. One of the very important 
problems in analysis of sequences is the longest common subsequence (LCS) 
problem. The computational problem of finding the longest common subse- 
quence of a set of k strings has been studied extensively over the last thirty 
years (see [5, 19, 21] and references). This problem has many applications. 

Computing Classification System 1998: F.1.3 
Mathematics Subject Classification 2010: 68Q15 

Key words and phrases: longest common subsequence, sequence annotation, NP- 
complete 

35 



36 



V. Popov 



When k = 2, the longest common subsequence is a measure of the similarity 
of two strings and is thus useful in molecular biology, pattern recognition, and 
text compression [26, 27, 34]. The version of LCS in which the number of 
strings is unrestricted is also useful in text compression [27], and is a special 
case of the multiple sequence alignment and consensus subsequence discovery 
problem in molecular biology [11, 12, 32]. 

The k-unrestricted LCS problem is NP-complete [27]. If the number of 
sequences is fixed at k with maximum length n, their longest common subse- 
quence can be found in 0(ti''^^) time, through an extension of the pairwise 
algorithm [21]. Suppose |Si| = n and IS2I = m, the longest common subse- 
quence of Si and S2 can be found in time 0(nm) [8, 18, 35]. 

Sequence-level investigation has become essential in modern molecular bi- 
ology. But to consider genetic molecules only as long sequences consisting of 
the 4 basic constituents is too simple to determine the function and physical 
structure of the molecules. Additional information about the sequences should 
be added to the sequences. Early works with these additional information are 
primary structure based, the sequence comparison is basically done on the 
primary structure while trying to incorporate secondary structure data [2, 9]. 
This approach has the weakness that it does not treat a base pair as a whole 
entity. Recently, an improved model was proposed [13, 14]. 

Arc-annotated sequences are useful in describing the secondary and tertiary 
structures of RNA and protein sequences. See [13, 4, 16, 22, 23] for further 
discussion and references. Structure comparison for RNA and for protein se- 
quences has become a central computational problem bearing many challeng- 
ing computer science questions. In this context, the longest arc preserving 
common subsequence problem (LAPCS) recently has received considerable 
attention [13, 14, 22, 23, 25]. It is a sound and meaningful mathematical 
formalization of comparing the secondary structures of molecular sequences. 
Studies for this problem have been undertaken in [5, 16, 1, 3, 6, 7, 10, 15, 20, 
28, 29, 30, 33]. 



2 Preliminaries and problem definitions 

Given two sequences S and T over some fixed alphabet L, the sequence T is a 
subsequence of S if T can be obtained from S by deleting some letters from S. 
Notice that the order of the remaining letters of S bases must be preserved. 
The length of a sequence S is the number of letters in it and is denoted as |S|. 
For simplicity, we use S[i] to denote the ith letter in sequence S, and S[l, j] to 
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denote the substring of S consisting of the ith letter through the jth letter. 

Given two sequences Si and Si (over some fixed alphabet L), the classic 
longest common subsequence problem asks for a longest sequence T that is a 
subsequence of both Si and Sj- 

An arc-annotated sequence of length n on a finite alphabet Z is a couple 
A = (S,P) where S is a sequence of length n on Z and P is a set of pairs 
(ii,l2), with 1 < ii < t2 < Ti. In this paper we will then call an element of S 
a base. A pair (li,l2) G P represents an arc linking bases S[li] and S[t2] of S. 
The bases S[ii] and S[i2] are said to belong to the arc (li,i2) and are the only 
bases that belong to this arc. 

Given two annotated sequences Si and S2 with arc sets Pi and P2 respec- 
tively, a common subsequence T of Si and S2 induces a bijective mapping from 
a subset of {1 , . . . , |Si |} to subset of {1 , ... , IS2I}. The common subsequence T is 
arc- preserving if the arcs induced by the mapping are preserved, i.e., for any 
(iijl) and [ii^ji] in the mapping, 

(il,i2) G Pi ^ (31,32) G P2. 

The LAPCS problem is to find a longest common subsequence of Si and 
S2 that is arc-preserving (with respect to the given arc sets Pi and P2) [13]. 
LAPCS: 

Instance: An alphabet L, annotated sequences Si and S2, Si,S2 € L* , 
with arc sets Pi and P2 respectively. 

Question: Find a longest common subsequence of Si and S2 that is arc- 
preserving. 

The arc structure can be restricted. We consider the following four natural 
restrictions on an arc set P which are first discussed in [13]: 

1. no sharing of endpoints: 

V(ii ,1-2), (i-3, i-4) e P, i-i / M, i-2 / i-3, and ii = <^ 12 = 14. 

2. no crossing: 

V(ii, 12), (1.3,1-4) G P,i-l G [i-3,'L4] <=i>i-2 G [13,14]- 

3. no nesting: 

V(ii,l2), (13,14) G P,li < I3 ^ I2 < l3- 

4. no arcs: 

P = 0. 

These restrictions are used progressively and inclusively to produce five 
distinct levels of permitted arc structures for LAPCS: 

- UNLIMITED — no restrictions; 

- CROSSING — restriction 1; 

- NESTED — restrictions 1 and 2; 
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- CHAIN — restrictions 1, 2 and 3; 

- PLAIN — restriction 4. 

The problem LAPCS is varied by these different levels of restrictions as 
LAPCS(x, y) which is problem LAPCS with Si having restriction level x and 
S2 having restriction level y. Without loss of generality, we always assume 
that X is the same level or higher than y . 

We give the definitions of two special cases of the LAPCS problem, which 
were first studied in [25]. The special cases are motivated from biological 
applications [17, 24]. 

The c-fragment LAPCS problem (c > 1): 

Instance: An alphabet L, annotated sequences Si and S2, Si,S2 € L*, with 
arc sets Pi and P2 respectively, where Si and S2 are divided into fragments of 
lengths exactly c (the last fragment can have a length less than c). 

Question: Find a longest common subsequence of Si and S2 that is arc- 
preserving. The allowed matches are those between fragments at the same 
location. 

The c-DiAGONAL LAPCS problem, (c > 0), is an extension of the c- 
FRAGMENT LAPCS problem, where base S2[i-] is allowed only to match bases 
in the range Si [i — c, i + c]. 

The c-DiAGONAL LAPCS and c-fragment LAPCS problems are relevant 
in the comparison of conserved RNA sequences where we already have a rough 
idea about the correspondence between bases in the two sequences. 

3 Previous results 

It is shown in [25] that the 1 -fragment LAPCS (crossing, crossing) and 
0-DiAGONAL LAPCS(crossing, CROSSING) are solvable in time 0(ti). An 
overview on known NP-completeness results for c-diagonal LAPCS and 
c-fragment LAPCS is given in Figure 1. 
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Figure 1: NP-completeness results for c-diagonal LAPCS (with c > 1) and 
C-FRAGMENT LAPCS (with c > 2) 
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4 The c-FRAGMENT LAPCS (UNLIMITED, PLAIN) and the 

C-DIAGONAL LAPCS(UNLIMITED, PLAIN) problem 

Let us consider the decision version of the c-fragment LAPCS problem. 

Instance: An alphabet L, a positive integer k, annotated sequences Si 
and S2, Si,S2 € 1*, with arc sets Pi and P2 respectively, where Si and S2 
are divided into fragments of lengths exactly c (the last fragment can have a 
length less than c). 

Question: Is there a common subsequence T of Si and S2 that is arc- 
preserving, |T| > k? (The allowed matches are those between fragments at the 
same location). 

Similarly, we can define the decision version of the c-diagonal LAPCS 
problem. 

Theorem 1 If \L\ = ^ , then 1 -FRAGMENT LAPCS(uNLiMiTED, plain) and 
0-DiAGONAL LAPCS (unlimited, plain) are NP -complete. 

Proof. It is easy to see that 1 -fragment LAPCS (unlimited, plain) = 
0-diagonal LAPCS(unlimited, plain). 

Let G = (V, E) be an undirected graph, and let I C V. We say that the set 
I is independent if whenever i, j € I then there is no edge between i and j. We 
make use of the following problem: 

Independent Set (IS): Instance: A graph G = (V, E), a positive integer 

k. 

Question: Is there an independent set I, I C V, with |I| > k? 
IS is NP-complete (see [31]). 

Let us suppose that L = {a}. We will show that IS can be polynomially 
reduced to problem 1 -fragment LAPCS (unlimited, plain). 

Let (G = (V, E), V = {1 , 2, . . . , n}, k) be an instance of IS. Now we transform 
an instance of the IS problem to an instance of the 1 -fragment LAPCS (un- 
limited, plain) problem as follows. 

. Si = S2 = a^. 

. Pi = E, P2 = 0. 

• ((Si,Pi),(S2,P2),k). 

First suppose that the graph G has an independent set I of size k. By 
definition of independent set, (i, j] ^ E for each i, j G I. For a given subset I, 
let 

M = {(i,t) :l G I}. 

Since I is an independent set, if (i, j) G E = Pi then either (i, I) ^ M or 
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^ M. This preserves arcs since P2 is empty. Clearly, Si [I] = S2W for 
each i G I, and the allowed matches are those between fragments at the same 
location. Therefore, there is a common subsequence T of Si and S2 that is 
arc-preserving, |T| = k, and the allowed matches are those between fragments 
at the same location. 

Now suppose that there is a common subsequence T of Si and S2 that is 
arc-preserving, |T| = k, and the allowed matches are those between fragments 
at the same location. In this case there is a valid mapping M, with |M| = k. 
Since c = 1 , it is easy to see that if (i, j) G M then i = j. Let 

I = {i : (i, t) € M}. 

Clearly, 

|I| = |M|=k. 

Let ii and 12 be any two distinct members of L Then let (i-i, ji), (i-2>j2) € M. 
Since 

i-i = ji>i-2 = 32)^1 1-2) 

it is easy to see that ji ^ jz. Since P2 is empty, (ji,j2) ^ ^2, so (ii,i2) ^ Pi. 
Since Pi = E, the set I of vertices is a size k independent set of G. □ 

5 The C-FRAGMENT LAPCS (CROSSING, CHAIN) and the 
C-DIAGONAL LAPCS (CROSSING, CHAIN) problem 

Theorem 2 If \L\ = 2, then 1 -fragment LAPCS(crossing, chain) and 
0-DiAGONAL LAPCS (crossing, chain) are NP -complete. 

Proof. It is easy to see that 1-fragment LAPCS(crossing, chain) = 
0-diagonal LAPCS (crossing, chain). 

Let us suppose that L = {a, b}. We will show that IS can be polynomially 
reduced to problem 1-fragment LAPCS (crossing, chain). 

Let (G = (V, E),V = {1,2, . . . ,rL}, k) be an instance of IS. Note that IS 
remains NP-complete when restricted to connected graphs with no loops and 
multiple edges. Let G = (V, E) be such a graph. Now we transform an instance 
of the IS problem to an instance of the 1-fragment LAPCS(crossing, 
chain) problem as follows. 
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There are two cases to consider. 

Case I. k > n 

• Si = S2 = a 
. Pi = P2 = 

• ((Sl,Pl),(S2,P2),k) 

Clearly, if I is an independent set, then I QV and |I| < |V| = n. Therefore, 
there is no an independent set I, with |I| > k. 

Since k > n and n G {1 , 2, . . . }, it is easy to see that k > 1 . Since Si = S2 = a 
and Pi=P2 = 0,T = ais the longest arc-preserving common subsequence. 
Therefore, there is no an arc-preserving common subsequence T such that 
|T| > k. 

Case II. k < n 

. Si = S2 = (ba^b)^ 

• Let a < |3. Then 

(a, (3) GPi ^ [3iG{l,2,...,n}3j G{l,2,...,n} 
GEAa=(l-l)(n + 2)+j + lA 
A(3 = (j-l)(n + 2) + i+l)]V 
V[3iE{l,2,...,uKa= (i-l)(n + 2) + 1 A(3 = i(n + 2))], 

(a,(3) €P2^3iG{l,2,...,n} 
(a= (i-1)(n + 2) + l A(3 =i(n + 2)). 

• ((Si,Pi),(S2,P2),k(n + 2)) 

First suppose that G has an independent set I of size k. By definition of 
independent set, (I, j) ^ E for each i, j G I. For a given subset I, let 

M = {(j,j):j = (n + 2)(i-l) + l,lGl, 

lG{l,2,...,n + 2}}. 

Let G M, and there exist I such that j = (n-|-2)(l— 1 )-|-L By definition 
of M, 

((n + 2)(t-l) + l,(n + 2)(i-l) + 1) G 
^ ((n + 2)l, (n + 2)i) G M. 
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By definition of Pi, ((n + 2)(i - 1) + 1, (n + 2)i) G where 1 = 1,2. Let 
(j, j) S M, and there exist i such that j = (n + 2)i. By definition of M, 

((n + 2)i, (n + 2)i) G 

^((n + 2)(i-l) + 1,(n + 2)(i-l) + l) G M. 
By definition of Pi, 

((n + 2)(i-l] + l,(n + 2]i] GPi 

where 1 = 1,2. Let (j, j) G M, and 

j = (n + 2)(t-l) + l 

where 1 < I < n + 2. By definition of M, i G L Since I is an independent set, 
if (I, I — 1 ) G E then I — 1 ^ L Since 

1 < I < n + 2, 

by definition of Pi , either 

((n + 2)(i-l] +1, (n + 2)(l-2) + i + l) G Pi 

or 

((n + 2)(t-l)+l,t) ^Pi 

for each t. Since 

1 < I < n + 2, 

by definition of P2, 

((n + 2)(i-l) + l,t) ^P2 

for each t. If 

((n + 2)(i-l) + l, (n + 2)(l-2) + i+l) G Pi, 
then in view of I — 1 ^ I, 

((n + 2)(l-2) + i+l,(n + 2)(l-2)+l + l) ^ M. 
This preserves arcs. Since |I| = k, it is easy to see that 

|M| =k(n + 2). 
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Clearly, Si [i] = S2W for each i € I, and the allowed matches are those between 
fragments at the same location. Therefore, there is a common subsequence T 
of Si and S2 that is arc-preserving, |T| = k(rL-|-2), and the allowed matches 
are those between fragments at the same location. 

Now suppose that there is a common subsequence T of Si and S2 that is 
arc-preserving, |T| = k, and the allowed matches are those between fragments 
at the same location. In this case there is a valid mapping M, with |M| = k. 
Since c = 1 , it is easy to see that if (i, j ) € M then i = j . Let I = {i : (I, i) € M}. 
Clearly, |I| = |M| = k. Let ii and ii be any two distinct members of I. Then 
let (ii, ji), (iijjl) S M. Since ii = ji,i2 = ^ i-i, it is easy to see that 
ji 7^ ji- Since P2 is empty, j2) ^ ^2, so (ii,i2) ^ Pi • Since Pi = E, the set I 
of vertices is a size k independent set of G. □ 



6 Conclusions 

In this paper, we considered two special cases of the LAPCS problem, which 
were first studied in [25]. We have shown that the decision version of the 
1-FRAGMENT LAPCS (CROSSING, CHAIN) and the decision version of the 0- 
DIAGONAL LAPCS(CROSSiNG, chain) are NP-complete for some fixed alpha- 
bet L such that |X| = 2. Also we have shown that if |Z| = 1 , then the decision 
version of the 1-fragment LAPCS(unlimited, plain) and the decision ver- 
sion of the 0-DiAGONAL LAPCS (UNLIMITED, plain) are NP-complete. This 
results answers some open questions in [16] (see Table 4.2. in [16]). 
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